Introduction

We will continue to visualize the RNA-Seq dataset of CD8+ T cells that do not express the transcription factor Tcf1 (encoded by Tcf7) versus cells that do (WT = wildtype). Our goal is to represent this data in various forms over the next few sessions.

Last week, we generated plots (PCA/correlation heatmap) to visualize the quality of the samples. Today, we will work on visualizing group of genes identified as differentially expressed (DEGs) between the two groups being compared in this study (Tcf7_KO versus WT).

The volcano plot is a scatterplot that visualizes both the (log2) fold change and level of significance (usually −log10[p−value]). This data visualization enables the display of a large amount of data in a concise way. Typically, only a handful of datapoints in the volcano plot are of interest, also known as hits. The hits are datapoints that surpass a threshold that is defined by the user for both the significance and fold change. Annotation of hits (with names) is used to draw attention to these most relevant or interesting datapoints.

We will be generating volcano plots using:

  1. ggplots2: an data visualization package that is offered via the tidyverse collection

  2. EnhancedVolcano: a package created by Kevin Blighe, Sharmila Rana, and Myles Lewis to generate publication-ready volcano plots.

Load the required libraries

Remember to first check if you require a package for this tutorial using library(). Once the package is installed there is no need to reinstall it over and over again.

#BiocManager::install('EnhancedVolcano') #if you need to install
#BiocManager::install("biomaRt") #if you need to install

library(EnhancedVolcano)
library(dplyr)
library(tidyr)
library(ggplot2)
library(genekitr)

Part I: Volcano plot w/ ggplot2

We will generate a volcano plot with ggplot2 that will take inspiration from this article.

deBock.et.al.2020

For the volcano plot above we can see that the authors are coloring “genes of significance” by p-adjusted values only. Whereby, genes that have a p-adjusted value of less than 0.05 are colored red and those that do not meet this threshold are being colored as black. Overall, this is perfectly fine to do as the authors wrote their methods transparently and leave no confusion with how the data was plotted.

Prepping the data

resdata <- read.csv("TCF1KOvsWT_unfiltered_matrix.csv", header = T)
head(resdata)
##                 Gene_id baseMean log2FoldChange      lfcSE      stat pvalue
## 1 ENSMUSG00000015143.16 6770.207      -2.946559 0.05835389 -50.49464      0
## 2  ENSMUSG00000015437.6 1365.097       6.213701 0.15815498  39.28868      0
## 3 ENSMUSG00000021756.13 5340.423      -1.932083 0.04966316 -38.90376      0
## 4  ENSMUSG00000025163.7 4445.134       2.453194 0.06245931  39.27668      0
## 5  ENSMUSG00000035042.3 2816.472       6.878698 0.14338097  47.97497      0
## 6 ENSMUSG00000039521.14 1506.650       6.560086 0.16962943  38.67304      0
##   padj WT_CD8_rep1 WT_CD8_rep2 WT_CD8_rep3 TCF1_KO_CD8_rep1 TCF1_KO_CD8_rep2
## 1    0 12433.34924 11759.60918 11762.92037         1518.717         1477.525
## 2    0    30.97347    42.89393    34.62547         2565.046         2960.111
## 3    0  8406.79850  8492.04492  8490.81374         2237.728         2187.952
## 4    0  1421.78206  1263.94113  1434.79279         7804.848         7529.308
## 5    0    48.95806    40.03433    54.10229         5114.678         5337.308
## 6    0    32.97176    32.40875    29.21524         2819.828         3377.057
##   TCF1_KO_CD8_rep3
## 1         1669.121
## 2         2556.930
## 3         2227.202
## 4         7216.133
## 5         6303.749
## 6         2748.418
str(resdata)
## 'data.frame':    19401 obs. of  13 variables:
##  $ Gene_id         : chr  "ENSMUSG00000015143.16" "ENSMUSG00000015437.6" "ENSMUSG00000021756.13" "ENSMUSG00000025163.7" ...
##  $ baseMean        : num  6770 1365 5340 4445 2816 ...
##  $ log2FoldChange  : num  -2.95 6.21 -1.93 2.45 6.88 ...
##  $ lfcSE           : num  0.0584 0.1582 0.0497 0.0625 0.1434 ...
##  $ stat            : num  -50.5 39.3 -38.9 39.3 48 ...
##  $ pvalue          : num  0 0 0 0 0 ...
##  $ padj            : num  0 0 0 0 0 ...
##  $ WT_CD8_rep1     : num  12433 31 8407 1422 49 ...
##  $ WT_CD8_rep2     : num  11759.6 42.9 8492 1263.9 40 ...
##  $ WT_CD8_rep3     : num  11762.9 34.6 8490.8 1434.8 54.1 ...
##  $ TCF1_KO_CD8_rep1: num  1519 2565 2238 7805 5115 ...
##  $ TCF1_KO_CD8_rep2: num  1478 2960 2188 7529 5337 ...
##  $ TCF1_KO_CD8_rep3: num  1669 2557 2227 7216 6304 ...

Right now, there is not a column which indicates whether the gene is considered differentially expressed based on p-adjusted values alone. We will add this in.

threshold_padj <- resdata$padj < 0.05 
length(which(threshold_padj)) ## Which values in an object are considered true and what is the length?? 
## [1] 4652

We now have created a logical vector. A TRUE value will correspond to genes that meet the criteria of having a padj of < 0.05 while FALSE means it is > 0.05 (or fails this criteria).

To add this vector to our results table we can use the $ notation on the left hand side of the assignment operator to create a column:

resdata$threshold <- threshold_padj ## Add vector as a column (threshold) to the resdata
str(resdata)
## 'data.frame':    19401 obs. of  14 variables:
##  $ Gene_id         : chr  "ENSMUSG00000015143.16" "ENSMUSG00000015437.6" "ENSMUSG00000021756.13" "ENSMUSG00000025163.7" ...
##  $ baseMean        : num  6770 1365 5340 4445 2816 ...
##  $ log2FoldChange  : num  -2.95 6.21 -1.93 2.45 6.88 ...
##  $ lfcSE           : num  0.0584 0.1582 0.0497 0.0625 0.1434 ...
##  $ stat            : num  -50.5 39.3 -38.9 39.3 48 ...
##  $ pvalue          : num  0 0 0 0 0 ...
##  $ padj            : num  0 0 0 0 0 ...
##  $ WT_CD8_rep1     : num  12433 31 8407 1422 49 ...
##  $ WT_CD8_rep2     : num  11759.6 42.9 8492 1263.9 40 ...
##  $ WT_CD8_rep3     : num  11762.9 34.6 8490.8 1434.8 54.1 ...
##  $ TCF1_KO_CD8_rep1: num  1519 2565 2238 7805 5115 ...
##  $ TCF1_KO_CD8_rep2: num  1478 2960 2188 7529 5337 ...
##  $ TCF1_KO_CD8_rep3: num  1669 2557 2227 7216 6304 ...
##  $ threshold       : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

Plotting with ggplot2

We will begin by using the ggplot() function to initialize the basic graph structure. The basic idea is that you can specify different parts of the plot and add them together using the + operator. These parts are often referred to as layers.

As input ggplot() requires a data frame

Let’s start:

#ggplot(resdata) #what happens? 

Notice, that you will get a blank plot because ggplot2 requires the user to specify layers using the + operator.

One type of layer is called geometric objects. Examples include:

  • points (geom_point, geom_jitter for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for boxplots!)

Any plot created with ggplot() must have at least one geom. You can add a geom to a plot using the + operator

#ggplot(resdata) +
  #geom_point() # note what happens here

You will find that even though we have added a layer by specifying geom_point, we still get an error. This is because each type of geom usually has a required set of aesthetics to be set. Aesthetic mappings are set with the aes() function and can be set inside geom_point(). If we supplied aesthetics within ggplot(), they will be used as defaults for every layer. Examples of aesthetics include:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • line type
  • size

We will begin by specifying the x- and y-axis since geom_point() requires this information for a scatter plot.

ggplot(resdata) +
  geom_point(aes(x = log2FoldChange, 
                 y = -log10(padj)))

Now, let’s move to customizing the appearance of the data points by adding color. We can color the points (i.e. those which represent significant genes) based on threshold column we created earlier.

ggplot(resdata) +
        geom_point(aes(x=log2FoldChange, y=-log10(padj),
                       colour=threshold))

Lets take a second to change the look of the non-data plotting elements, such as the grid and labels. Many of these elements can be altered within a theme() layer.

The ggplot2 theme system handles non-data plot elements such as: + axis titles and label style + plot background + presence of major or minor gridlines + legend appearance

There are a few build-in themes we can use that will change the background/foreground colors by adding it as an additional layer. Themes can be found here

I like the classic theme, so let’s add it in.

ggplot(resdata) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(padj), colour=threshold)) + 
  theme_classic()

Let’s edit the title & x and y labels, including its size, text, and position. In addition, we want to remove the legend completely.

ggplot(resdata) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(padj), colour=threshold)) + 
  theme_classic() + 
        ggtitle("TCF7 KO versus WT") +
        xlab("Log2 fold change") + 
        ylab("-Log10(padj)") +
        theme(legend.position = "none",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) 

We are almost there but the colors are not right. We can fix this with scale_color_manual and change the values to black and red like the article.

In addition, we are adding the parameter scale_x_continuous to set the x-axis limit. This is the default scales, we are simply adjusting it so that the plot is centered. We are also using coord_cartesian function. Setting “limits” with the Cartesian coordinate system will zoom the plot but will not change the underlying data. Here we are not changing the plot but are using the “clip” argument and setting it to “no”. This will allow the ‘outlier’ points to be drawn on the plot.

ggplot(resdata) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(padj), colour=threshold)) + 
  theme_classic() + 
        ggtitle("TCF7 KO versus WT") +
        xlab("Log2 fold change") + 
        ylab("-Log10(padj)") +
        theme(legend.position = "none",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) +
  scale_color_manual(values = c("black", "red")) +
  scale_x_continuous(limits = c(-20, 20)) + 
  coord_cartesian(ylim=c(0, 300), clip = "off")                                            

This is a great way to get a global picture of what is going on, but what if wanted to label our plot with gene names from our DE list? (Notice, that the authors labeled HOXA9 on their plot.) HOXA9 is a gene symbol. However, this is where we hit a road block since our gene ids are currently listed as ENSEMBL gene ids.

head(resdata)
##                 Gene_id baseMean log2FoldChange      lfcSE      stat pvalue
## 1 ENSMUSG00000015143.16 6770.207      -2.946559 0.05835389 -50.49464      0
## 2  ENSMUSG00000015437.6 1365.097       6.213701 0.15815498  39.28868      0
## 3 ENSMUSG00000021756.13 5340.423      -1.932083 0.04966316 -38.90376      0
## 4  ENSMUSG00000025163.7 4445.134       2.453194 0.06245931  39.27668      0
## 5  ENSMUSG00000035042.3 2816.472       6.878698 0.14338097  47.97497      0
## 6 ENSMUSG00000039521.14 1506.650       6.560086 0.16962943  38.67304      0
##   padj WT_CD8_rep1 WT_CD8_rep2 WT_CD8_rep3 TCF1_KO_CD8_rep1 TCF1_KO_CD8_rep2
## 1    0 12433.34924 11759.60918 11762.92037         1518.717         1477.525
## 2    0    30.97347    42.89393    34.62547         2565.046         2960.111
## 3    0  8406.79850  8492.04492  8490.81374         2237.728         2187.952
## 4    0  1421.78206  1263.94113  1434.79279         7804.848         7529.308
## 5    0    48.95806    40.03433    54.10229         5114.678         5337.308
## 6    0    32.97176    32.40875    29.21524         2819.828         3377.057
##   TCF1_KO_CD8_rep3 threshold
## 1         1669.121      TRUE
## 2         2556.930      TRUE
## 3         2227.202      TRUE
## 4         7216.133      TRUE
## 5         6303.749      TRUE
## 6         2748.418      TRUE
resdata$Gene_id <- sub("\\.\\d+", "", resdata$Gene_id)
ids <- resdata$Gene_id
gene_symbol_map <- transId(ids, transTo = "symbol", org = "mouse")

The option we used above was to specify the pattern as “.”, followed by one or more digits (\d+) at the end of the string and replace with blank (’"). regex

names(resdata)[1] = "input_id"

resdata_gene <- merge(resdata, gene_symbol_map, by.x = "input_id", by.y = "input_id")

str(resdata_gene)
## 'data.frame':    17276 obs. of  15 variables:
##  $ input_id        : chr  "ENSMUSG00000000001" "ENSMUSG00000000028" "ENSMUSG00000000037" "ENSMUSG00000000056" ...
##  $ baseMean        : num  3904.96 66.3 4.94 833.78 25622.4 ...
##  $ log2FoldChange  : num  0.0747 0.3184 -0.2252 -0.2261 0.3501 ...
##  $ lfcSE           : num  0.0569 0.2499 0.9064 0.0819 0.1706 ...
##  $ stat            : num  1.311 1.274 -0.248 -2.76 2.052 ...
##  $ pvalue          : num  0.18981 0.2026 0.8038 0.00578 0.0402 ...
##  $ padj            : num  0.3838 0.401 0.9057 0.0241 0.119 ...
##  $ WT_CD8_rep1     : num  3.60e+03 5.89e+01 9.99e-01 9.47e+02 1.80e+04 ...
##  $ WT_CD8_rep2     : num  4013.92 53.38 9.53 850.25 25351.27 ...
##  $ WT_CD8_rep3     : num  3795.82 64.92 5.41 899.18 24233.5 ...
##  $ TCF1_KO_CD8_rep1: num  4075.6 78.88 4.53 737.14 32210.41 ...
##  $ TCF1_KO_CD8_rep2: num  3943.78 56.67 5.06 767.1 27000.26 ...
##  $ TCF1_KO_CD8_rep3: num  3998.7 85 4.1 801.8 26937.4 ...
##  $ threshold       : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ symbol          : chr  "Gnai3" "Cdc45" "Scml2" "Narf" ...
#Add a column to dataframe to specify up or down regulated
resdata_gene$diffexpressed <- "Unchanged"

#if log2FC > 1 and padj < 0.05 set as upregulated
resdata_gene$diffexpressed[resdata_gene$padj < 0.05 & resdata_gene$log2FoldChange > 1] <- "Upregulated"

#if log2FC > -1 and padj < 0.05 set as downregulated
resdata_gene$diffexpressed[resdata_gene$padj < 0.05 & resdata_gene$log2FoldChange < -1] <- "Downregulated"

Figure_Volcano

ggplot(resdata_gene) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(padj), colour=diffexpressed)) + 
  theme_classic() + 
        ggtitle("TCF7 KO versus WT") +
        xlab("Log2 fold change") + 
        ylab("-Log10(padj)") +
        theme(legend.position = "right",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) +
  scale_color_manual(values = c("blue", "black", "red"),
                     labels = c("Downregulated", "Not Significant", "Upregulated")) +
  scale_x_continuous(limits = c(-20, 20)) + 
  coord_cartesian(ylim=c(0, 300), clip = "off")  

resdata_sig <- as.data.frame(resdata_gene) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05)
head(resdata_sig)
##             input_id   baseMean log2FoldChange      lfcSE       stat
## 1 ENSMUSG00000000301   36.06472      -1.245867 0.34387653  -3.623008
## 2 ENSMUSG00000000303  740.39042       4.274595 0.35738885  11.960628
## 3 ENSMUSG00000000489   96.63058       2.107325 0.23386230   9.010966
## 4 ENSMUSG00000000530  127.87156      -2.016456 0.19967045 -10.098920
## 5 ENSMUSG00000000631 1771.29476       1.183873 0.07321316  16.170217
## 6 ENSMUSG00000000817 1200.51462       6.961123 0.49820582  13.972385
##        pvalue        padj WT_CD8_rep1 WT_CD8_rep2 WT_CD8_rep3 TCF1_KO_CD8_rep1
## 1 2.91197e-04 1.75073e-03   44.961485    48.61312    58.43048         14.50715
## 2 5.71000e-33 3.32000e-31   92.920402    57.19191    68.16889       1068.08907
## 3 2.04000e-19 5.88000e-18   37.967476    33.36195    37.87160        132.37776
## 4 5.59000e-24 2.15000e-22  225.806568   216.37605   173.12733         56.21521
## 5 8.18000e-59 1.24000e-56 1067.085904  1046.61189  1136.14814       2619.44765
## 6 2.30000e-44 2.15000e-42    5.994865    28.59595    22.72296       2572.29940
##   TCF1_KO_CD8_rep2 TCF1_KO_CD8_rep3 threshold symbol diffexpressed
## 1         25.30009         24.57602      TRUE   Pemt Downregulated
## 2       1986.56336       1169.40889      TRUE   Cdh1   Upregulated
## 3        148.76455        189.44014      TRUE  Pdgfb   Upregulated
## 4         46.55217         49.15204      TRUE Acvrl1 Downregulated
## 5       2509.76930       2248.70571      TRUE Myo18a   Upregulated
## 6       2394.40087       2179.07365      TRUE   Fasl   Upregulated
#write.csv(resdata_sig, file="Tcf7_ko_versus_wt_DEG.csv")

# Count the number of rows where diffexpressed is "Upregulated"
upregulated_count <- nrow(resdata_sig[resdata_sig$diffexpressed == "Upregulated", ])
upregulated_count
## [1] 560
downregulated_count <- nrow(resdata_sig[resdata_sig$diffexpressed == "Downregulated", ])
downregulated_count
## [1] 412
p <- ggplot(resdata_gene) +
        geom_point(aes(x=log2FoldChange, 
                       y=-log10(padj), colour=diffexpressed)) + 
  theme_classic() + 
        ggtitle("TCF7 KO versus WT") +
        xlab("Log2 fold change") + 
        ylab("-Log10(padj)") +
        theme(legend.position = "right",
              plot.title = element_text(size = rel(1.25), hjust = 0.5), 
              axis.title = element_text(size = rel(1.25))) +
  scale_color_manual(values = c("blue", "black", "red"),
                     labels = c("Downregulated", "Not Significant", "Upregulated")) +
  scale_x_continuous(limits = c(-20, 20)) + 
  coord_cartesian(ylim=c(0, 300), clip = "off") 

  # Add annotation
  p + annotate("text", x = 15, y = 10, label = "560", colour = "red", size = 8, hjust = 0) +
  annotate("text", x = -15, y = 10, label = "412", colour = "blue", size = 8, hjust = 1)


Part II: Volcano plot w/ EnhancedVolcano

In its most basic usage, EnhancedVolcano requires: + data-frame (i.e. resdata_gene), + a column of rownames (i.e. lab =), + a column name containing log2 fold changes (i.e. x =log2FoldChange) + a column name containing nominal or adjusted pvalues (i.e. y = padj)

There are numerous arguments that can be used to customize the volcano plot with EnhancedVolcano. We will begin with the most fundamental and most important of these options which includes visualizing sections of the volcano plot based off p adjusted cutoff of < 0.05 and absolute log2 of 1. For both, vertical and horizontal lines will be drawn to show these cutoffs exactly where we want them. Note, this was done in the plot above, however were you confident where these lines were drawn?

EnhancedVolcano(resdata_gene,
                lab = as.character(resdata_gene$symbol),
                x = 'log2FoldChange',
                y = 'padj', 
                title = "TCF7_KO versus WT",
                FCcutoff = 1,
                pCutoff = 0.05)

Now, lets convert our colors to red and black like the article. Naturally, EnhancedVolcano will color based on four quadrants. These quadrants are shown (in order):
1) NS - shown in black 2) Log2FC - shown in green 3) p-value - shown in blue 4) p-value & Log2FC - shown in red

In addition, let’s move the legend position to the right of the plot.

EnhancedVolcano(resdata_gene,
                lab = as.character(resdata_gene$symbol),
                x = 'log2FoldChange',
                y = 'padj', 
                title = "TCF7_KO versus WT",
                subtitle = "Differential expression", 
                FCcutoff = 1,
                pCutoff = 0.05,
                labSize = 4, 
                axisLabSize = 12, 
                col=c('black', 'black', 'black', 'red3'),
                legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
                legendPosition = 'right',
                legendLabSize = 16,
                legendIconSize = 5.0)

There might be too many genes labeled here making the plot look cluttered. Let’s label a few select DE genes. First, lets get a complete of list of our DEGs using:

resdata_sig <- as.data.frame(resdata_gene) %>% dplyr::filter(abs(log2FoldChange) > 1 & padj < 0.05)
head(resdata_sig)
##             input_id   baseMean log2FoldChange      lfcSE       stat
## 1 ENSMUSG00000000301   36.06472      -1.245867 0.34387653  -3.623008
## 2 ENSMUSG00000000303  740.39042       4.274595 0.35738885  11.960628
## 3 ENSMUSG00000000489   96.63058       2.107325 0.23386230   9.010966
## 4 ENSMUSG00000000530  127.87156      -2.016456 0.19967045 -10.098920
## 5 ENSMUSG00000000631 1771.29476       1.183873 0.07321316  16.170217
## 6 ENSMUSG00000000817 1200.51462       6.961123 0.49820582  13.972385
##        pvalue        padj WT_CD8_rep1 WT_CD8_rep2 WT_CD8_rep3 TCF1_KO_CD8_rep1
## 1 2.91197e-04 1.75073e-03   44.961485    48.61312    58.43048         14.50715
## 2 5.71000e-33 3.32000e-31   92.920402    57.19191    68.16889       1068.08907
## 3 2.04000e-19 5.88000e-18   37.967476    33.36195    37.87160        132.37776
## 4 5.59000e-24 2.15000e-22  225.806568   216.37605   173.12733         56.21521
## 5 8.18000e-59 1.24000e-56 1067.085904  1046.61189  1136.14814       2619.44765
## 6 2.30000e-44 2.15000e-42    5.994865    28.59595    22.72296       2572.29940
##   TCF1_KO_CD8_rep2 TCF1_KO_CD8_rep3 threshold symbol diffexpressed
## 1         25.30009         24.57602      TRUE   Pemt Downregulated
## 2       1986.56336       1169.40889      TRUE   Cdh1   Upregulated
## 3        148.76455        189.44014      TRUE  Pdgfb   Upregulated
## 4         46.55217         49.15204      TRUE Acvrl1 Downregulated
## 5       2509.76930       2248.70571      TRUE Myo18a   Upregulated
## 6       2394.40087       2179.07365      TRUE   Fasl   Upregulated
write.csv(resdata_sig, file="Tcf7_ko_versus_wt_DEG.csv")

Looking at the supplementary data, the authors identified some genes such as ‘Cd19’, ‘Ccr7’, and ‘Gzma’ as biologically significant to their findings. In addition, this article focusing on two transcription factors ‘Tcf7’ and ‘Lef1’ so it would be a good idea to view those. I also am including ‘Actn1’ as more of positive control. It is the most signficantly downregulated gene in our dataset and I want to make sure it shows up on the correct side of the graph.

EnhancedVolcano(resdata_gene,
                lab = as.character(resdata_gene$symbol),
                x = 'log2FoldChange',
                y = 'padj', 
                selectLab = c('Ccr7', 'Cd19', 'Actn1', 'Gzma', 'Ccl5', 'Tcf7', 'Lef1'),
                title = "TCF7_KO versus WT",
                subtitle = "Differential expression", 
                FCcutoff = 1,
                pCutoff = 0.05,
                labSize = 4, 
                axisLabSize = 12, 
                col=c('black', 'black', 'black', 'red3'),
                legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
                legendPosition = 'right',
                legendLabSize = 16,
                legendIconSize = 5.0,
                drawConnectors = TRUE,
                widthConnectors = 1.0,
                colConnectors = 'black')

It looks like a few genes are hard to read. We can fix this by adding a boxed label. This will make the gene name pop out and will be given a white background. I am also choosing to bold the gene name. In addition, lets get rid of the major and minor grid lines as they are taking away from the image aesthetics.

EnhancedVolcano(resdata_gene,
                lab = as.character(resdata_gene$symbol),
                x = 'log2FoldChange',
                y = 'padj', 
                selectLab = c('Ccr7', 'Cd19', 'Actn1', 'Gzma', 'Ccl5', 'Tcf7', 'Lef1'),
                title = "TCF7_KO versus WT",
                subtitle = "Differential expression", 
                FCcutoff = 1,
                pCutoff = 0.05,
                labSize = 4, 
                axisLabSize = 12, 
                col=c('black', 'black', 'black', 'red3'),
                labCol = 'black',
                labFace = 'bold',
                boxedLabels = TRUE,
                legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
                legendPosition = 'right',
                legendLabSize = 16,
                legendIconSize = 5.0,
                drawConnectors = TRUE,
                widthConnectors = 1.0,
                colConnectors = 'black', 
                gridlines.major = FALSE,
                gridlines.minor = FALSE)

ggsave("TCF7_KO_verus_WT-Enhancedvolano.png")

If you wanted to play around with colors (up versus down regulated) below is a script to help you get started.

keyvals <- ifelse(
  resdata_gene$log2FoldChange < -2, 'royalblue',
  ifelse(resdata_gene$log2FoldChange > 2, 'red',
         'black'))
keyvals[is.na(keyvals)] <- 'black'
names(keyvals)[keyvals == 'red'] <- 'Upregulated'
names(keyvals)[keyvals == 'black'] <- 'NA'
names(keyvals)[keyvals == 'royalblue'] <- 'Downregulated'

EnhancedVolcano(resdata_gene,
  lab = as.character(resdata_gene$symbol),
  x = 'log2FoldChange',
  y = 'padj', 
  selectLab = rownames(resdata_gene)[which(names(keyvals) %in%
                                             c('Upregulated','Downregulated'))],
  xlab = bquote(~Log[1]~ 'fold change'),
  colCustom = keyvals,
  title = "TCF7_KO versus WT",
  subtitle = "Differential expression", 
  FCcutoff = 1,
  pCutoff = 0.05,
  labSize = 4, 
  axisLabSize = 12,
  legendLabels=c('Not sig.','Log2FC','padj', 'padj & Log2FC'),
  legendPosition = 'left',
  legendLabSize = 16,
  legendIconSize = 5.0,
  gridlines.major = FALSE,
  gridlines.minor = FALSE)

Session info

sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] fstcore_0.9.14         genekitr_1.2.5         tidyr_1.3.0           
## [4] dplyr_1.1.2            EnhancedVolcano_1.14.0 ggrepel_0.9.4         
## [7] ggplot2_3.4.4         
## 
## loaded via a namespace (and not attached):
##   [1] shadowtext_0.1.2       fastmatch_1.1-4        systemfonts_1.0.5     
##   [4] plyr_1.8.9             igraph_1.4.2           lazyeval_0.2.2        
##   [7] splines_4.2.0          BiocParallel_1.32.6    usethis_2.2.2         
##  [10] GenomeInfoDb_1.34.9    urltools_1.7.3         digest_0.6.33         
##  [13] yulab.utils_0.0.6      htmltools_0.5.7        GOSemSim_2.22.0       
##  [16] viridis_0.6.4          GO.db_3.15.0           fansi_1.0.6           
##  [19] magrittr_2.0.3         memoise_2.0.1          remotes_2.4.2.1       
##  [22] openxlsx_4.2.5.2       Biostrings_2.66.0      graphlayouts_1.0.0    
##  [25] enrichplot_1.16.2      prettyunits_1.2.0      colorspace_2.1-0      
##  [28] blob_1.2.4             textshaping_0.3.7      xfun_0.41             
##  [31] callr_3.7.3            crayon_1.5.2           RCurl_1.98-1.13       
##  [34] jsonlite_1.8.8         scatterpie_0.2.1       ape_5.7-1             
##  [37] glue_1.6.2             polyclip_1.10-6        gtable_0.3.4          
##  [40] zlibbioc_1.44.0        XVector_0.38.0         pkgbuild_1.4.0        
##  [43] BiocGenerics_0.44.0    scales_1.2.1           DOSE_3.22.1           
##  [46] DBI_1.1.3              miniUI_0.1.1.1         Rcpp_1.0.11           
##  [49] viridisLite_0.4.2      xtable_1.8-4           progress_1.2.3        
##  [52] gridGraphics_0.5-1     tidytree_0.4.2         bit_4.0.5             
##  [55] europepmc_0.4.3        stats4_4.2.0           profvis_0.3.8         
##  [58] htmlwidgets_1.6.2      httr_1.4.7             fgsea_1.24.0          
##  [61] RColorBrewer_1.1-3     ellipsis_0.3.2         urlchecker_1.0.1      
##  [64] pkgconfig_2.0.3        farver_2.1.1           sass_0.4.6            
##  [67] utf8_1.2.4             labeling_0.4.3         ggplotify_0.1.2       
##  [70] tidyselect_1.2.0       rlang_1.1.2            reshape2_1.4.4        
##  [73] later_1.3.0            AnnotationDbi_1.60.0   munsell_0.5.0         
##  [76] tools_4.2.0            cachem_1.0.8           downloader_0.4        
##  [79] cli_3.6.2              generics_0.1.3         RSQLite_2.3.1         
##  [82] devtools_2.4.5         evaluate_0.23          stringr_1.5.0         
##  [85] fastmap_1.1.1          ragg_1.2.5             yaml_2.3.8            
##  [88] ggtree_3.4.4           processx_3.8.3         knitr_1.45            
##  [91] bit64_4.0.5            fs_1.6.3               tidygraph_1.2.3       
##  [94] zip_2.3.0              purrr_1.0.2            KEGGREST_1.38.0       
##  [97] ggraph_2.1.0           nlme_3.1-162           mime_0.12             
## [100] aplot_0.2.2            DO.db_2.9              ggvenn_0.1.10         
## [103] xml2_1.3.4             compiler_4.2.0         rstudioapi_0.15.0     
## [106] png_0.1-8              treeio_1.20.2          tibble_3.2.1          
## [109] tweenr_2.0.2           bslib_0.4.2            stringi_1.7.12        
## [112] highr_0.10             ps_1.7.5               lattice_0.22-5        
## [115] Matrix_1.5-4           vctrs_0.6.4            pillar_1.9.0          
## [118] lifecycle_1.0.4        triebeard_0.4.1        jquerylib_0.1.4       
## [121] data.table_1.14.10     cowplot_1.1.1          bitops_1.0-7          
## [124] httpuv_1.6.9           patchwork_1.1.3        qvalue_2.28.0         
## [127] R6_2.5.1               promises_1.2.1         gridExtra_2.3         
## [130] IRanges_2.32.0         sessioninfo_1.2.2      codetools_0.2-19      
## [133] pkgload_1.3.3          MASS_7.3-60            withr_2.5.2           
## [136] S4Vectors_0.36.2       GenomeInfoDbData_1.2.9 parallel_4.2.0        
## [139] hms_1.1.3              clusterProfiler_4.4.4  fst_0.9.8             
## [142] grid_4.2.0             ggfun_0.1.3            rmarkdown_2.25        
## [145] ggforce_0.4.1          Biobase_2.58.0         shiny_1.7.4           
## [148] geneset_0.2.7

Citation

title: “EnhancedVolcano: publication-ready volcano plots with enhanced colouring and labeling” author: “Kevin Blighe, Sharmila Rana, Myles Lewis”

Note: if you use Enhanced Volcano in published research, please cite:

Blighe K, Rana S, Lewis M (2022). EnhancedVolcano: Publication-ready volcano plots with enhanced colouring and labeling. R package version 1.16.0, https://github.com/kevinblighe/EnhancedVolcano.